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^The ability to accurately predict gene 
function based on gene sequence is an 
important tool in many areas of biologi- 
cal research. Such predictions have be- 
come particularly important in the ge- 
nomics age in which numerous gene se- 
quences are generated with little or no 
accompanying experimentally deter- 
mined functional Information. Almost 
all functional prediction methods rely 
on the identification, characterization, 
and quantification of sequence similar- 
ity between the gene of interest and 
genes for which functional information 
is available. Because sequence is the 
prime determining factor of function, 
sequence similarity is taken to imply 
similarity of function. There is no doubt 
that this assumption is valid in most 
cases. However, sequence similarity does 
not ensure identical functions, and it is 
common for groups of genes that are 
similar in sequence to have diverse (al- 
though usually related) functions. 
Therefore, the identification of se- 
quence similarity is frequently not 
enough to assign a predicted function to 
an uncharacterized gene; one must have 
a method of choosing among similar 
genes with different functions. In such 
cases, most fund ional prediction meth- 
ods assign likely fund ions by quantify- 
ing the levels of similarity among genes. 
I suggest that functional predictions can 
be greatly improved by focusing on how 
the genes became similar in sequence 
(i.e., evolution) rather than on the se- 
quence similarity itself. It is well estab- 
lished that many aspects of comparative 
biology can benefit from evolutionary 
studies (Felsenstein 1985), and compara- 
tive molecular biology is no exception 
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(e.g., Altschul et al. 1989; Goldman et al. 
1996). In this commentary, I discuss the 
use of evolutionary information in the 
prediction of gene function. To appreci- 
ate the potential of a phylogenomic ap- 
proach to the prediction of gene func- 
tion, it is necessary to first discuss how 
gene sequence is commonly used to pre- 
dict gene function and some general lea 



Sequence Similarity, Homology, 
and Functional Predictions 

To make use of the identification of se- 
quence similarity between genes, it is 
helpful to understand how such similar- 
ity arises. Genes can become similar in 
sequence either as a result of convergence 
(similarities that have arisen without a 
common evolutionary history) or de- 
scent with modification from a com- 
mon ancestor (also known as homology). 
It is imperative to recognize that se- 
quence similarity and homology are not 
interchangeable terms. Not all ho- 
mologs are similar in sequence (i.e., ho- 
mologous genes can diverge so much 
that, similarities are difficull or impos- 
sible to detect) and not all similarities 
are due to homology (Reeck et al. 1987; 
Hillis 1994). Similarity due to conver- 
gence, which is likely limited to small 
regions of genes, can be useful for some 
functional predictions (Henikoff et al. 
1997). However, most sequence-based 
functional predictions are based on the 
identification (and subsequent analysis) 
of similarities that are thought to be due 
to homology. Because homology is a 
statement about common ancestry, it 
cannot be proven directly from se- 
quence similarity. In these cases, the in- 
ference of homology is made based on 
finding levels of sequence similarity that 
are thought to be too high to be due to 



convergence (the exact threshold for 
such an inference is not well estab- 
lished). 

Improvements in database search 
programs have made the identification 
of likely homologs much faster, easier, 
and more reliable (Altschul et al. 1997; 
Henikoff et al. 1998). However, as dis- 
cussed above, in many cases the identi- 
fication of homologs is not sufficient to 
make specific functional predictions be- 
cause not all homologs have the same 
function. The available similarity-based 
functional prediction methods can be 
distinguished by how they choose the 
homolog whose function is most rel- 
evant to a particular uncharacterized 
gene (Table 1). Some methods are rela- 
tively simple — many researchers use the 
highest scoring homolog (as determined 
by programs like BLAST or BLAZE) as the 
basis for assigning function. While high 
est hit methods are very fast, can be au- 
tomated readily, and are likely accurate 
in many instances, they do not take ad- 
vantage of any information about how 
genes and gene functions evolve. For ex- 
ample, gene duplication and subsequent 
divergence of function of the duplicates 
can result, in homologs with different 
functions being present within one spe- 
cies. Specific terms have been created to 
distinguish homologs in these cases 
(Table 2): Genes of the same duplicate 
group are called orthologs (e.g., (5-globin 
from mouse and humans) , and different 
duplicates are called paralogs (e.g., a- 
and p-globin) (Fitch 1970). Because gene 
duplications are frequently accompa- 
nied by functional divergence, dividing 
genes into groups of orthologs and para- 
logs can improve the accuracy of func- 
tional predictions. Recognizing that the 
one-to-one sequence comparisons used 
by most methods do not reliably distin- 
guish orthologs from paralogs, Tatusov 
et al. (1997) developed the COG cluster- 
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Table 1. Methods of Predicting 
Gene Function When Homologs 
Have Multiple Functions 

Highest Hit 

The uncharacterized gene is 
assigned the function (or frequently, 
the annotated function) of the gene 
that is identified as the highest hit 
by a similarity search program (e.g., 
Tomb et al. 1997). 

Top Hits 

Identify top 10+ hits for the 
uncharacterized gene. Depending 
on the degree of consensus of the 
functions of the top hits, the query 
sequence is assigned a specific 
function, a general activity with 
unknown specificity, or no function 
(e.g., Blattner et al. 1997). 

Clusters of Orthologous Groups 

Genes are divided into groups of 
orthologs based on a cluster 
analysis of pairwise similarity scores 
between genes from different 
species. Uncharacterized genes are 
assigned the function of 
characterized orthologs (Tatusov et 
al. 1997). 

Phylogenomics 

Known functions are overlaid onto 
an evolutionary tree of all 
homologs. Functions of 
uncharacterized genes are predicted 
by their phylogenetic position 
relative to characterized genes (e.g., 
Eisen et al. 1995, 1997). 



ing method (see Table 1). Although the 
COG method is clearly a major advance 
in identifying orthologous groups of 
genes, it is limited in its power because 
clustering is a way of classifying levels of 
similarity and is not an accurate method 
of inferring evolutionary relationships 
(Swofford et al. 1996). Thus, as sequence 
similarity and clustering are not reliable 
estimators of evolutionary relatedness, 
and as the incorporation of such phylo- 
genetic information has been so useful 
to other areas of biology, evolutionary 
techniques should be useful for improv- 
ing the accuracy of predicting function 
based on sequence similarity. 

Phylogenomics 

There are many ways in which evolu- 



tionary information can be used to im- 
prove functional predictions. Below, I 
present an outline of one such phylog- 
enomic method (see Fig. 1), and I com- 
pare this method to nonevolutionary 
functional prediction methods. This 
method is based on a relatively simple 
assumption — because gene functions 
change as a result of evolution, recon- 
structing the evolutionary history of 
genes should help predici die functions 
of uncharacterized genes. The first step 
is the generation of a phylogenetic tree 
representing the evolutionary history of 
the gene of interest and its homologs. 
Such trees are distinct from clusters and 
other means of characterizing sequence 
similarity because they are inferred by 
special techniques that help convert pat- 

lationships (sec SvvoiTord et. al. I 996). Af- 
ter the gene tree is inferred, biologically 
determined functions of the various ho- 
mologs are overlaid onto the tree. Fi- 
nally, the structure of the tree and the 
relative phylogenetic positions of genes 
of different functions are used to trace 
the history of functional changes, which 
is then used to predict functions of un- 
characterized genes. More detail of this 
method is provided below. 

Identification of Homologs 

The first step in studying the evolution 
of a particular gene is the identification 
of homologs. As with similarity-based 
functional prediction methods, likely 
homologs of a particular gene are iden- 
tified through database searches. Be- 
cause phylogenetic methods benefit 
greatly from more data, it is useful to 
augment this initial list by using identi- 
fied homologs as queries for further 



database searches or using automatic it- 
erated search methods such as PSI- 
BLAST (Altschul et al. 1997). If a gene 
family is very large (e.g., ABC transport- 
ers), it may be necessary to only analyze 
a subset of homologs. However, this 
must be done with extreme care, as one 
might accidentally leave out proteins 
that would be important for the analy- 
sis. 

Alignment and Masking 

Sequence alignment, for phylogenetic 
analysis has a particular purpose — it is 
the assignment of positional homology. 
Each column in a multiple sequence 
alignment is assumed to include amino 
acids or nucleotides that have a com- 
mon evolutionary history, and each col- 
umn is treated separately in the phylo- 
genetic analysis. Therefore, regions in 
which the assignment of positional ho- 
mology is ambiguous should be ex- 
cluded (Gatesy et al. 1993). The exclu- 
sion of certain alignment positions (also 
known as masking) helps to give phylo- 
genetic methods much of their discrimi- 
natory power. Phylogenetic trees gener- 
ated without masking (as is done in 
many sequence analysis software pack- 
ages) are less likely to accurately reflect 
the evolution of the genes than trees 
with masking. 

Phylogenetic Trees 

For extensive information about gener- 
ating phylogenetic trees from sequence 
alignments, see Swofford et al. (1996). In 
summary, there are three methods com- 
monly used: parsimony, distance, and 
maximum likelihood (Table 3) . and each 
has its advantages and disadvantages. I 



Table 2. Types of Molecular Homology 



Homolog Genes that are descended from a common ancestor 

(e.g., all globins) 

Ortholog Homologous genes that have diverged from each other 

after speciation events (e.g., human (3- and chimp 
p-globin) 

Paralog Homologous genes that have diverged from each other 

after gene duplication events (e.g., (3- and 7-globin) 
Xenolog Homologous genes that have diverged from each other 

after lateral gene transfer events (e.g., antibiotic 

resistance genes in bacteria) 
Positional homology Common ancestry of specific amino acid or nucleotide 

positions in different genes 
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Figure 1 ( Jul lii ic ol a ph\ logenomn 1 1 ici Imdnl, ig\ In ihis m, ihod inl'oimai ion about the 
evolutionary relationships among genes is used to predict, the functions ol uncharac.teri/ed 
genes (see text for details). Two hypothetical scenarios are presented and the path ol trying to 
infer the fund ion of two unchaiac teri/ed genes in each case is traced. (.4) A gene family has 
undergone a gene duplication that was accompanied b\ functional di\ ergence J>> Gene lunc- 
l ion has changed in one lineage. The true tree (which is assumed to be unknown) is shown at 
the bottom. The genes are referred to by numbers (which represent t he spec ies from vvhic h the se 
genes come! and letters (which in A represent different genes within a species). The thin 
branches in the evolutionary trees correspond to the gene phylogeny and the thick gray 
branches in A (bottom) correspond to the phylogeny of the species in which the duplicate genes 
evolve in parallel (as paralogs). Different colors {and symbols) represent different gene func- 
tions; gray (with hatching) represents either unknown or unpredictable functions. 



prefer distance methods because they 
are the quickest, when using large data 
sets. Before using any particular tree it is 
important to estimate the robustness 
and accuracy of the phylogenetic pat- 



terns it. shows (through techniques such 
as the comparison of trees generated by 
different methods and bootstrapping). 
Finally, in most cases, it is also useful to 
determine a root for the tree. 



Functional Predictions 

To make functional predictions based 
on the phylogenetic tree, it is necessary 
to first overlay any known functions 
onto the tree. There are many ways this 
"map" can then be used to make func- 
tional predictions, but I recommend 
splitting the task into two steps. First, 
the tree can be used to identify likely 
gene duplication events in the past. This 
allows the division of the genes into 
groups of orthologs and paralogs (e.g., 
Eisenetal. 1995). Uncharacterized genes 
can be assigned a likely function if the 
function of any ortholog is known (and 
if all characterized orthologs have the 
same function). Second, parsimony re- 
construction techniques (Maddison and 
Maddison 1992) can be used to infer the 
likely functions of uncharacterized 
genes by identifying the evolutionary 
scenario that requires the fewest func- 
tional changes over time (Fig. 1). The in- 
corporation of more realistic models of 
functional change (and not just mini- 
mizing the total number of changes) 
may prove to be useful, but the parsi- 
mony minimization methods are prob- 
ably sufficient in most cases. 



Is the Phylogenomic Method Worth 
the Trouble? 

Phylogenomic methods require many 

manual labor than similarity-based 
functional prediction methods. Is the 
phylogenomic approach worth the 
trouble? Many specific examples exist in 
which gene function has been shown to 
correlate well with gene phylogeny (Ei- 
senetal. 1995; Atchley and Fitch 1997). 
Although no systematic comparisons of 
phylogenetic versus similarity-based 
functional prediction methods have 
been done, there are a variety of reasons 
to believe that the phylogenomic 
method should produce more accurate 
predictions dian similarity-based meth- 
ods. In particular, there are many condi- 
tions in which similarity-based methods 
are likely to make inaccurate predictions 
but which can be dealt with well by phy- 
logenetic; methods (see Table 4). 

A specific example helps illustrate a 
potential problem with similarity-based 
methods. Molecular phylogenetic meth- 
ods show conclusively that mycoplas- 
mas share a common ancestor with low- 
GC Gram-positive bacteria (Weisburg ct 
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Table 3. Molecular Phylogenetic Methods 



Parsimony Possible trees are compared and each is given a score that is a reflection of the minimum number 

of character state changes (e.g., amino acid substitutions) that would be required over 
evolutionary time to fit the sequences into that tree. The optimal tree is considered to be the 
one requiring the fewest changes (the most parsimonious tree). 

Distance The optimal tree is generated by first calculating the estimated evolutionary distance between all 

pairs of sequences. Then these distances are used to generate a tree in which the branch 
patterns and lengths best represent the distance matrix. 

Maximum likelihood Maximum likelihood is similar to parsimony methods in that possible trees are compared and 

given a score. The score is based on how likely the given sequences are to have evolved in a 
particular tree given a model of amino acid or nucleotide substitution probabilities. The optimal 
tree is considered to be the one that has the highest probability. 

Bootstrapping Alignment positions within the original multiple sequence alignment are resampled and new data 

sets are made. Each bootstrapped data set is used to generate a separate phylogenetic tree and 
the trees are compared. Each node of the tree can be given a bootstrap percentage indicating 
how frequently those species joined by that node group together in different trees. Bootstrap 
percentage does not correspond directly to a confidence limit. 



al. 1989). However, examination of the 
percent similarity between mycoplasmal 
genes and their homologs in bacteria 
does not clearly show this relationship. 



This is because mycoplasmas have un- 
dergone an accelerated rate of molecular 
evolution relative to other bacteria. 
Thus, a BLAST search with a gene from 



Bacillus subtilis (a low GC Gram-positive 
species) will result in a list in which the 
mycoplasma homologs (if they exist) 
score lower than genes from many spe- 
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cies of bacteria less closely related to B. 
suhtilis. When amounts or rates of 
change vary between lineages, phyloge- 
netic methods are better able to infer 
evolutionary relationships than similar- 
ity methods (including clustering) be- 
cause they allow for evolutionary 
branches to have different lengths, 
'["bus, in those cases in which gene func- 
tion correlates with gene phytogeny and 
in which amounts or rates of change 
vary between lineages, similarity-based 
methods will be more likely than phy- 
logenomic methods to make inaccurate 
functional predictions (see Table 4). 

Another major advantage of phyloge- 
netic methods over most similarity 
methods comes from the process of 
masking (see above). For example, a de- 
letion of a large section of a gene in one 
species will greatly affect similarity mea- 
sures but may not affect the function of 
that gene. A phylogenetlc analysis in- 
cluding these genes could exclude the 
region of the deletion from the analysis 
by masking. In addition, regions of 
genes that are highly variable between 
species are more likely to undergo con- 
vergence and such regions can be ex- 
cluded from phylogenetic analysis by 
masking. Masking thus allows the exclu- 

quence similarity is likely to be "noisy" 
or misleading rather than a biologically 
important signal. The pairwise sequence 
comparisons used by most similarity- 
based functional prediction methods do 
not allow such masking. Phylogenetic 
methods have been criticized because of 
their dependence (for most methods) on 
multiple sequence alignments that, are 
not always reliable and unbiased. How- 
ever, multiple sequence alignments also 
allow for masking, which is probably 
more valuable than the cost of depend- 
ing on alignments. 

The conditions described above and 
highlighted in Table 4 are just some ex- 
amples of conditions in which evolu- 
tionary methods are more likely to make 
accurate functional predictions than 
similarity-based methods. Phylogenetic 
methods are particularly useful when 
the history of a gene family includes 
many of these conditions (e.g., multiple 
gene duplications plus rate variation) or 
when the gene family is very large. The 
principle is simple — the more compli- 
cated the history of a gene family, the 
more useful it is to try to infer that his- 
toiy. Thus although the phylogenomic 



method is slow and labor intensive, I be- 
lieve it is worth using if accuracy is the 
main objective. In addition, informa- 
tion about the evolutionary relation- 
ships among gene ho mo logs is useful for 
summarizing relationships among genes 
and for putting functional information 
into a useful context. 

Despite the evolution of these meth- 
ods, and likely continued improvements 

membered that the key word is predic- 
tion. All methods are going to make in- 
accurate predictions of functions. For 
example, none of the methods described 
can perform well when gene functions 
can change with little sequence change 
as has been seen in proteins like opsins 
(Yokoyama 1997). Thus, sequence data- 
bases and genome researchers should 
make clear which functions assigned to 
genes are based on predictions and 
which are based on experiments. In ad- 
dition, all prediction methods should 
use only experimentally determined 
functions as their grist for predictions. 
This will hopefully limit error propaga- 
tion that can happen by using an inac- 
curate prediction of function to then 
predict the function of a new gene, 
which is a particular problem for the 
highest hit methods, as they rely on the 
function of only one gene at a time to 
make predictions (Eisenetal. 1997). De- 
spite these and other potential prob- 
lems, functional predictions are of great 
value in guiding research and in sorting 
through huge amounts of data. I believe 
that the increased use of phylogenetic 
methods can only serve to improve the 
accuracy of such functional predictions. 
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